How to Use PyTorchVideo
It looked quite useful, so here are some notes.
Using Pretrained Models
1. Loading a Model (via torch hub)
There is a dedicated hubconf.py on the master branch, so specify the path to it and load the model by name as a string.
model_name = "slow_r50"
path = 'path/to/directory/of/hubconf.py'
model = torch.hub.load(path, source="local",
model=model_name, pretrained=True)
Note that the model has the type 'pytorchvideo.models.net.Net', so when integrating it into Lightning, make it an attribute.
2. Prepare Transforms to Convert Input Video to the Required Format
Match the video to the model's specifications. slow_50 requires 256x256 dimensions with normalized RGB. Also, the number of frames per input varies by model, so you need to specify this in advance.
For details, see here.
from pytorchvideo.transforms import (
ApplyTransformToKey,
ShortSideScale,
UniformTemporalSubsample
)
side_size = 256
mean = [0.45, 0.45, 0.45]
std = [0.225, 0.225, 0.225]
crop_size = 256
num_frames = 8
sampling_rate = 8
frames_per_second = 30
clip_duration = (num_frames * sampling_rate)/frames_per_second
transform = ApplyTransformToKey(
key="video",
transform=Compose(
[
UniformTemporalSubsample(num_frames),
Lambda(lambda x: x/255.0),
NormalizeVideo(mean, std),
ShortSideScale(
size=side_size
),
CenterCropVideo(crop_size=(crop_size, crop_size))
]
),
)
3. Encoding the Video
If you have a video file ready, encoding can also be handled for you. .avi files work as well.
The steps are:
- Encode the video
- Clip by specifying seconds
- Pass through the transform
from pytorchvideo.data.encoded_video import EncodedVideo
sample_path = 'sample.avi'
# 1
video = EncodedVideo.from_path(sample_path)
# 2
video_cliped = video.get_clip(start_sec=0, end_sec=10)
# 3
video_data = transform(video_data)
4. Feeding Input to the Model
The transformed video is in dictionary format where:
video gives you the video tensor (C, T, H, W), audio gives you the audio, and video_name gives you the original path.
The model takes a tensor of shape (batch_size, C, T, H, W), so:
input = video_data['video']
prediction = model(input.unsqueeze(0))
Bonus: Integrating with Lightning
There are two things to do:
- Integrate the model
- Write a DataModule (Transform + label assignment)
class VideoClassification(pytorch_lightning.LightningModule):
def __init__(self, path):
super().__init__()
self.model = torch.hub.load(path, source="local",
model=model_name, pretrained=True)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-1)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
y = self.model(batch["video"])
t = batch["label"]
loss = F.cross_entropy(y, t)
return loss
from pytorchvideo.transforms import (
ApplyTransformToKey,
RandomShortSideScale,
RemoveKey,
ShortSideScale,
UniformTemporalSubsample
)
from torchvision.transforms import (
Compose,
Normalize,
RandomCrop,
RandomHorizontalFlip
)
class KineticsDataModule(pytorch_lightning.LightningDataModule):
def setup(self);
self.train_dataset = pytorchvideo.data.Kinetics(
data_path=os.path.join(self._DATA_PATH, "train.csv"),
clip_sampler=pytorchvideo.data.make_clip_sampler("random", self._CLIP_DURATION),
transform=train_transform
)
def train_dataloader(self):
train_transform = Compose(
[
ApplyTransformToKey(
key="video",
transform=Compose(
[
UniformTemporalSubsample(8),
Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
RandomShortSideScale(min_size=256, max_size=320),
RandomCrop(244),
RandomHorizontalFlip(p=0.5),
]
),
),
]
)
return torch.utils.data.DataLoader(
train_dataset,
batch_size=self._BATCH_SIZE,
num_workers=self._NUM_WORKERS,
)